Traffic flow prediction is an important part of smart transportation. The goal is to predict future traffic conditions based on historical data recorded by sensors and the traffic network. As the city continues to build, parts of the transportation network will be added or modified. How to accurately predict expanding and evolving long-term streaming networks is of great significance. To this end, we propose a new simulation-based criterion that considers teaching autonomous agents to mimic sensor patterns, planning their next visit based on the sensor's profile (e.g., traffic, speed, occupancy). The data recorded by the sensor is most accurate when the agent can perfectly simulate the sensor's activity pattern. We propose to formulate the problem as a continuous reinforcement learning task, where the agent is the next flow value predictor, the action is the next time-series flow value in the sensor, and the environment state is a dynamically fused representation of the sensor and transportation network. Actions taken by the agent change the environment, which in turn forces the agent's mode to update, while the agent further explores changes in the dynamic traffic network, which helps the agent predict its next visit more accurately. Therefore, we develop a strategy in which sensors and traffic networks update each other and incorporate temporal context to quantify state representations evolving over time.
translated by 谷歌翻译
As the basis for prehensile manipulation, it is vital to enable robots to grasp as robustly as humans. In daily manipulation, our grasping system is prompt, accurate, flexible and continuous across spatial and temporal domains. Few existing methods cover all these properties for robot grasping. In this paper, we propose a new methodology for grasp perception to enable robots these abilities. Specifically, we develop a dense supervision strategy with real perception and analytic labels in the spatial-temporal domain. Additional awareness of objects' center-of-mass is incorporated into the learning process to help improve grasping stability. Utilization of grasp correspondence across observations enables dynamic grasp tracking. Our model, AnyGrasp, can generate accurate, full-DoF, dense and temporally-smooth grasp poses efficiently, and works robustly against large depth sensing noise. Embedded with AnyGrasp, we achieve a 93.3% success rate when clearing bins with over 300 unseen objects, which is comparable with human subjects under controlled conditions. Over 900 MPPH is reported on a single-arm system. For dynamic grasping, we demonstrate catching swimming robot fish in the water.
translated by 谷歌翻译
Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task in computer vision. Mainstream approaches follow a multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels and without any further training. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel framework called CLIP-ES for WSSS. Our framework improves all three stages of WSSS with special designs for CLIP: 1) We introduce the softmax function into GradCAM and exploit the zero-shot ability of CLIP to suppress the confusion caused by non-target classes and backgrounds. Meanwhile, to take full advantage of CLIP, we re-explore text inputs under the WSSS setting and customize two text-driven strategies: sharpness-based prompt selection and synonym fusion. 2) To simplify the stage of CAM refinement, we propose a real-time class-aware attention-based affinity (CAA) module based on the inherent multi-head self-attention (MHSA) in CLIP-ViTs. 3) When training the final segmentation model with the masks generated by CLIP, we introduced a confidence-guided loss (CGL) to mitigate noise and focus on confident regions. Our proposed framework dramatically reduces the cost of training for WSSS and shows the capability of localizing objects in CLIP. Our CLIP-ES achieves SOTA performance on Pascal VOC 2012 and MS COCO 2014 while only taking 10% time of previous methods for the pseudo mask generation. Code is available at https://github.com/linyq2117/CLIP-ES.
translated by 谷歌翻译
估计看不见对象的6D姿势对许多现实世界应用非常有需求。但是,当前的最新姿势估计方法只能处理以前训练的对象。在本文中,我们提出了一项新任务,以使算法能够估计测试过程中新颖对象的6D姿势估计。我们收集一个具有真实图像和合成图像的数据集,并且在测试集中最多可见48个看不见的对象。同时,我们提出了一个名为infimum Add(IADD)的新指标,这是对具有不同类型姿势歧义的对象的不变测量。还提供了针对此任务的两个阶段基线解决方案。通过训练端到端的3D对应网络,我们的方法可以准确有效地找到看不见的对象和部分视图RGBD图像之间的相应点。然后,它使用算法鲁棒到对象对称性从对应关系中计算6D姿势。广泛的实验表明,我们的方法的表现优于几个直观基线,从而验证其有效性。所有数据,代码和模型都将公开可用。项目页面:www.graspnet.net/unseen6d
translated by 谷歌翻译
现在,我们目睹了深度学习方法在各种蛋白质(或数据集)中的重大进展。但是,缺乏评估不同方法的性能的标准基准,这阻碍了该领域的深度学习进步。在本文中,我们提出了一种称为PEER的基准,这是一种用于蛋白质序列理解的全面和多任务基准。 PEER提供了一组不同的蛋白质理解任务,包括蛋白质功能预测,蛋白质定位预测,蛋白质结构预测,蛋白质 - 蛋白质相互作用预测和蛋白质 - 配体相互作用预测。我们评估每个任务的不同类型的基于序列的方法,包括传统的特征工程方法,不同的序列编码方法以及大规模的预训练蛋白质语言模型。此外,我们还研究了这些方法在多任务学习设置下的性能。实验结果表明,大规模的预训练蛋白质语言模型可实现大多数单个任务的最佳性能,共同训练多个任务进一步提高了性能。该基准的数据集和源代码均可在https://github.com/deepgraphlearning/peer_benchmark上获得
translated by 谷歌翻译
近年来,目睹了直接建立在点云上的学识渊博的代表。尽管变得越来越表现力,但大多数现有的表示仍然很难产生有序的点集。受到球形多视图扫描仪的启发,我们提出了一种称为Spotlights的新型采样模型,代表3D形状作为深度值的紧凑型1D阵列。它模拟了均匀分布在球体上的摄像机的配置,在该球体上,每个虚拟摄像机都会通过小同心球形盖上的样品点从主要点施放光线,以探测可能与球体包围的物体的相交。因此,结构化点云被隐式地作为深度的函数。我们提供了该新样本方案的详细几何分析,并在点云完成任务的背景下证明了其有效性。合成数据和真实数据的实验结果表明,我们的方法可以达到竞争精度和一致性,同时显着降低了计算成本。此外,我们在下游点云注册任务上显示出优于最新完成方法的性能。
translated by 谷歌翻译
我们介绍了日常桌面对象的998 3D型号的数据集及其847,000个现实世界RGB和深度图像。每个图像的相机姿势和对象姿势的准确注释都以半自动化方式执行,以促进将数据集用于多种3D应用程序,例如形状重建,对象姿势估计,形状检索等。3D重建由于缺乏适当的现实世界基准来完成该任务,并证明我们的数据集可以填补该空白。整个注释数据集以及注释工具和评估基线的源代码可在http://www.ocrtoc.org/3d-reconstruction.html上获得。
translated by 谷歌翻译
视觉变压器在识别和检测等实质性视野任务中显示了很大的视觉表示功率,从而在手动设计更有效的架构方面吸引了快速增长的努力。在本文中,我们建议使用神经架构搜索来自动化此过程,不仅可以搜索架构,还可以搜索搜索空间。中央观点是逐步发展使用权重共享超空网的E-T错误引导的不同搜索维度。此外,我们提供了一般视觉变压器的设计指南,根据空间搜索过程进行广泛的分析,这可以促进对视觉变压器的理解。值得注意的是,搜索空间的搜索模型,名为S3(用于搜索空间的短路),从搜索到的空间实现了卓越的性能,以最近提出的型号,例如在ImageNet上进行评估时的Swin,Deit和Vit。 S3的有效性也在对象检测,语义细分和视觉问题上说明,展示其泛度到下游视觉和视觉语言任务。代码和型号将在https://github.com/microsoft/cream中使用。
translated by 谷歌翻译
随着深度学习技术的快速发展,各种最近的工作试图应用图形神经网络(GNN)来解决诸如布尔满足(SAT)之类的NP硬问题,这表明了桥接机器学习与象征性差距的潜力。然而,GNN预测的解决方案的质量并未在文献中进行很好地研究。在本文中,我们研究了GNNS在学习中解决最大可满足性(MaxSAT)问题的能力,从理论和实践角度来看。我们构建了两种GNN模型来学习来自基准的MaxSAT实例的解决方案,并显示GNN通过实验评估解决MaxSAT问题的有吸引力。我们还基于算法对准理论,我们还提出了GNNS可以在一定程度上学会解决MaxSAT问题的影响的理论解释。
translated by 谷歌翻译
我们开发了一个新颖的框架,将稀疏集团拉索的正规化者添加到深度学习中的自适应优化者家族中,例如动量,亚当,亚当,阿姆斯格拉德,阿德哈西亚人,并创建了新的优化者,这些优化者被称为群体动量,命名因此,Adagrad小组,亚当集团,Amsgrad集团和Adahessian集团等。我们基于原始偶的方法在随机凸设置中建立理论上证明的收敛保证。我们评估了新优化器对具有最先进的深度学习模型的三个大型现实广告单击数据集的正则效应。实验结果表明,与使用幅度修剪方法的后处理过程相比,模型的性能可以在相同的稀疏度水平上显着提高。此外,与没有幅度修剪的情况相比,我们的方法可以实现极高的稀疏性,并具有明显的更好或高度竞争性的性能。
translated by 谷歌翻译